Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2532] Consolidated shuffle fixes #1609

Closed
wants to merge 10 commits into from
Closed

[SPARK-2532] Consolidated shuffle fixes #1609

wants to merge 10 commits into from

Conversation

mridulm
Copy link
Contributor

@mridulm mridulm commented Jul 27, 2014

Status of the PR

  • Cherry pick and merge changes from internal branch to spark master
  • Remove WIP comments and 2G branch references.
  • Tests for BlockObjectWriter
  • Tests for ExternalAppendOnlyMap
  • Tests for ShuffleBlockManager

@SparkQA
Copy link

SparkQA commented Jul 27, 2014

QA tests have started for PR 1609. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17242/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 27, 2014

QA results for PR 1609:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17242/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 27, 2014

QA tests have started for PR 1609. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17243/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 27, 2014

QA results for PR 1609:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17243/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 27, 2014

QA tests have started for PR 1609. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17244/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 27, 2014

QA results for PR 1609:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17244/consoleFull


// val confCopy = conf.clone
// // Ensure we always write data after object ser
// confCopy.set("spark.serializer.objectStreamReset", "1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to remove commented out debug code like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already removed - waiting for tests to pass locally :-)

@mateiz
Copy link
Contributor

mateiz commented Jul 27, 2014

@adav @andrewor14 would be good if you two take a look at this when it's merging correctly.

@SparkQA
Copy link

SparkQA commented Jul 27, 2014

QA tests have started for PR 1609. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17256/consoleFull

@andrewor14
Copy link
Contributor

Did you mean @aarondav?

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA results for PR 1609:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17256/consoleFull

initialized = true
wasOpenedOnce = true;
} finally {
if (! initialized) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more spaces

@aarondav
Copy link
Contributor

@mridulm Thanks for submitting this! I would like to dig a little deeper into understanding the specific issues you found, in order to understand the solutions you have provided (since the specific solutions seem interleaved with a lot of new asserts and code paths).

You mention that there was an issue if shuffle writes co-occur with shuffle fetches, which is true, but should not typically occur due to the barrier before the reduce stage of a shuffle. In what situations does this happen (outside of failure conditions)?

Did you observe a prior pattern of close/revert/close on the same block writer?

How did task failures induce inconsistent state on the map side? Was it due to the same close/revert/close pattern?

@aarondav
Copy link
Contributor

I created #1678, which only includes the changes directly related to fixing the issues with shuffle file consolidation (essentially forking off a piece of this PR), intended as a simple candidate for review to make the 1.1 release. The smaller PR is not intended as a replacement for this more complete one, however -- it is merely an option to fix some of the more severe bugs in time for the next major release. If we can get this one in for 1.1 instead, then we should.

aarondav added a commit to aarondav/spark that referenced this pull request Jul 31, 2014
All changes from this PR are by @mridulm and are drawn from his work in apache#1609.
This patch is intended to fix all major issues related to shuffle file consolidation
that @mridulm found, while minimizing changes to the code, with the hope that it may
be more easily merged into 1.1.

This patch is **not** intended as a replacement for apache#1609, which provides many
additional benefits, including fixes to ExternalAppendOnlyMap, improvements to
DiskBlockObjectWriter's API, and several new unit tests.

If it is feasible to merge apache#1609 for the 1.1 deadline, that is a preferable option.
*/
object Java7Util {
def isSymlink(file: File) = java.nio.file.Files.isSymbolicLink(file.toPath)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this still compile on Java 6? I don't see any changes to pom.xml or anything to exclude it. Or maybe did you mean to call this by reflection?

@mateiz
Copy link
Contributor

mateiz commented Jul 31, 2014

So FYI I'm going to make a more detailed pass through this soon to see if we can get all of it into 1.1. It would be nice to get all these changes in so we can QA them along with the other QA we do for 1.1, but if that doesn't work out, we can split some of them into smaller patches as Aaron did.

@mateiz
Copy link
Contributor

mateiz commented Aug 1, 2014

So I looked through this and I also think it would be good to split it into smaller patches for 1.1. As far as I can see there are several orthogonal improvements here:

  • Shuffle file consolidation fixes that Aaron copied in SPARK-2791: Fix committing, reverting and state tracking in shuffle file consolidation #1678
  • ExternalAppendOnlyMap fixes to deal with writes past end of stream; we also need these in ExternalSorter
  • Fixes to directory creation in DiskBlockManager (I'm still not sure when this would be a problem actually if all accesses to these directories are through getFile; needs some investigation)
  • Fixes to isSymlink (though as is this seems like it would only compile on Java 7)
  • Improvements to the API of DiskBlockObjectWriter

Of these, the first two are most critical. So I'd like to get those into 1.1, and then we can do API refactoring and the other fixes on the master branch. For the directory creation fix I'd still like to understand when that can be a problem (I'm probably just missing something), but it's also one we can add in 1.1 during the QA window.

I'm going to update the JIRA to create sub-tasks for these things so we can track where each one is fixed. Thanks again for putting this together Mridul, this is very helpful.

asfgit pushed a commit that referenced this pull request Aug 1, 2014
…ile consolidation

All changes from this PR are by mridulm and are drawn from his work in #1609. This patch is intended to fix all major issues related to shuffle file consolidation that mridulm found, while minimizing changes to the code, with the hope that it may be more easily merged into 1.1.

This patch is **not** intended as a replacement for #1609, which provides many additional benefits, including fixes to ExternalAppendOnlyMap, improvements to DiskBlockObjectWriter's API, and several new unit tests.

If it is feasible to merge #1609 for the 1.1 deadline, that is a preferable option.

Author: Aaron Davidson <[email protected]>

Closes #1678 from aarondav/consol and squashes the following commits:

53b3f6d [Aaron Davidson] Correct behavior when writing unopened file
701d045 [Aaron Davidson] Rebase with sort-based shuffle
9160149 [Aaron Davidson] SPARK-2532: Minimal shuffle consolidation fixes
mateiz added a commit to mateiz/spark that referenced this pull request Aug 1, 2014
All these changes are from @mridulm's work in apache#1609, but extracted here
to fix this specific issue. This particular set of changes is to make
sure that we read exactly the right range of bytes from each spill file
in EAOM: some serializers can write bytes after the last object (e.g.
the TC_RESET flag in Java serialization) and that would confuse the
previous code into reading it as part of the next batch. There are also
improvements to cleanup to make sure files are closed.
@mateiz
Copy link
Contributor

mateiz commented Aug 1, 2014

Opened #1722 to do the second fix (map batch writing) for 1.1, including applying the same fix to ExternalSorter.

asfgit pushed a commit that referenced this pull request Aug 4, 2014
…in ExternalMap / Sorter

All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.

In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.

Author: Matei Zaharia <[email protected]>

Closes #1722 from mateiz/spark-2792 and squashes the following commits:

5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
18fe865 [Matei Zaharia] Update docs on objectStreamReset
576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
0374217 [Matei Zaharia] Remove super paranoid code to close file handles
bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
asfgit pushed a commit that referenced this pull request Aug 4, 2014
…in ExternalMap / Sorter

All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.

In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.

Author: Matei Zaharia <[email protected]>

Closes #1722 from mateiz/spark-2792 and squashes the following commits:

5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
18fe865 [Matei Zaharia] Update docs on objectStreamReset
576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
0374217 [Matei Zaharia] Remove super paranoid code to close file handles
bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
@andrewor14
Copy link
Contributor

@mridulm Are the issues in this PR taken care of by #1722 and and #1678? Do we still need this PR?

@mridulm
Copy link
Contributor Author

mridulm commented Sep 4, 2014

I think it got split into four issues, two of which got committed, not sure
of the other other two .... And the first one was regressed upon in
1.1.already.
But this or probably is defunct now .... Will close
On 04-Sep-2014 5:03 am, "andrewor14" [email protected] wrote:

@mridulm https://github.com/mridulm Are the issues in this PR taken
care of by #1722 #1722 and and #1678
#1678? Do we still need this PR?


Reply to this email directly or view it on GitHub
#1609 (comment).

@mridulm mridulm closed this Sep 4, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…ile consolidation

All changes from this PR are by mridulm and are drawn from his work in apache#1609. This patch is intended to fix all major issues related to shuffle file consolidation that mridulm found, while minimizing changes to the code, with the hope that it may be more easily merged into 1.1.

This patch is **not** intended as a replacement for apache#1609, which provides many additional benefits, including fixes to ExternalAppendOnlyMap, improvements to DiskBlockObjectWriter's API, and several new unit tests.

If it is feasible to merge apache#1609 for the 1.1 deadline, that is a preferable option.

Author: Aaron Davidson <[email protected]>

Closes apache#1678 from aarondav/consol and squashes the following commits:

53b3f6d [Aaron Davidson] Correct behavior when writing unopened file
701d045 [Aaron Davidson] Rebase with sort-based shuffle
9160149 [Aaron Davidson] SPARK-2532: Minimal shuffle consolidation fixes
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…in ExternalMap / Sorter

All these changes are from mridulm's work in apache#1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.

In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.

Author: Matei Zaharia <[email protected]>

Closes apache#1722 from mateiz/spark-2792 and squashes the following commits:

5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
18fe865 [Matei Zaharia] Update docs on objectStreamReset
576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
0374217 [Matei Zaharia] Remove super paranoid code to close file handles
bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…h BosonExec enabled (apache#1609)

(cherry picked from commit e843d83b83e7351ed3c610a1ac9f016fe5732cb3)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants